Enhanced Confix Stripping Stemmer and Ants Algorithm for Classifying News Document in Indonesian Language
نویسندگان
چکیده
Ants algorithm is a universal and flexible solution which was first designed for solving optimization problem such as Traveling Salesman Problem. Analogy between finding the shortest way by ants and finding documents most alike, became a stimulus of ant based text document clustering method. This method consist of two phases, which are finding documents most alike (trial phase) and clusters making (dividing phase). In this paper, we implemented ant based document clustering method on 253 news documents in Indonesian language. Beside that, we developed enhanced confix stripping stemmer as an improvement of confix stripping stemmer for stemming news documents in Indonesian language. Result of the experiments proved that ants algorithm can be applied for classification of news document in Indonesian language, with the best Fmeasure achieved from experiments was 0.86. The experiments also showed that enhanced confix stripping stemmer had been succesfully solved confix stripping stemmer’s problems and reduce terms size up to 32.66%, while confix stripping stemmer only reduce 30.95%.
منابع مشابه
A New Document Embedding Method for News Classification
Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...
متن کاملLemmatization Technique in Bahasa: Indonesian Language
many researches and inventions have been made in the field of linguistics and technology. Even so, the integration between linguistics and technology is not always reliable to all language. Every language is unique in its linguistic nature and rules. In this paper, a lemmatization technique in Bahasa (Indonesian language) is presented. It has achieved good precision by using The Indonesian Dict...
متن کاملStemming in Tamil for Affix Stripping
Stemming is the one of the most important step in many of the Natural Language processing tasks. Stemming reduces inflected words to a common stem/root word. Stemming process mainly carried out in English language because Tamil language is more complex in structure and more over it consists of critical grammatical rules. Tamil is a Dravidian language, mainly spoken by Tamil. Tamil words have mo...
متن کاملA Light Weight Stemmer in Kokborok
Started from the very beginning, Stemming has been playing significant roles in several Natural Language Processing Applications such as information retrieval (IR), machine translation (MT), morph analysis and deciding the part of speech (POS). Several stemmers have been developed for a large number of languages including Indian languages; however no work has been done in Kokborok, a native lan...
متن کاملAnunsupervised Approach Todevelop Stemmer
This paper presents an unsupervised approach for the development of a stemmer (For the case of Urdu & Marathi language). Especially, during last few years, a wide range of information in Indian regional languages has been made available on web in the form of e-data. But the access to these data repositories is very low because the efficient search engines/retrieval systems supporting these lang...
متن کامل